---
title: "HierarchicalDocumentSplitter"
id: hierarchicaldocumentsplitter
slug: "/hierarchicaldocumentsplitter"
description: "Use this component to create a multi-level document structure based on parent-children relationships between text segments."
---

# HierarchicalDocumentSplitter

Use this component to create a multi-level document structure based on parent-children relationships between text segments.

<div className="key-value-table">

|  |  |
| --- | --- |
| **Most common position in a pipeline** | In indexing pipelines after [Converters](../converters.mdx)   and [`DocumentCleaner`](documentcleaner.mdx)                                                                                                                                                                                                              |
| **Mandatory init variables**           | `block_sizes`: Set of block sizes to split the document into. The blocks are split in descending order.                                                                                                                                                                                                                  |
| **Mandatory run variables**            | `documents`: A list of documents to split into hierarchical blocks                                                                                                                                                                                                                                                       |
| **Output variables**                   | `documents`: A list of hierarchical documents                                                                                                                                                                                                                                                                            |
| **API reference**                      | [PreProcessors](/reference/preprocessors-api)                                                                                                                                                                                                                                                                                   |
| **GitHub link**                        | [https://github.com/deepset-ai/haystack/blob/dae8c7babaf28d2ffab4f2a8dedecd63e2394fb4/haystack/components/preprocessors/hierarchical_document_splitter.py](https://github.com/deepset-ai/haystack/blob/dae8c7babaf28d2ffab4f2a8dedecd63e2394fb4/haystack/components/preprocessors/hierarchical_document_splitter.py#L12) |

</div>

## Overview

The `HierarchicalDocumentSplitter` divides documents into blocks of different sizes, creating a tree-like structure.

A block is one of the chunks of text that the splitter produces. It is similar to cutting a long piece of text into smaller pieces: each piece is a block. Blocks form a tree structure where your full document is the root block, and as you split it into smaller and smaller pieces you get child-blocks and leaf-blocks, down to whatever smallest size specified.

The [`AutoMergingRetriever`](../retrievers/automergingretriever.mdx) component then leverages this hierarchical structure to improve document retrieval.

To initialize the component, you need to specify the `block_size`, which is the “maximum length” of each of the blocks, measured in the specific unit (see `split_by` parameter). Pass a set of sizes (for example, `{20, 5}`), and it will:

- First, split the document into blocks of up to 20 units each (the “parent” blocks).
- Then, it will split each of those into blocks of up to 5 units each (the “child” blocks).

This descending order of sizes builds the hierarchy.

These additional parameters can be set when the component is initialized:

- `split_by` can be `"word"` (default), `"sentence"`, `"passage"`, `"page"`.
- `split_overlap` is an integer indicating the number of overlapping words, sentences, or passages between chunks, 0 being the default.

## Usage

### On its own

```python
from haystack import Document
from haystack.components.preprocessors import HierarchicalDocumentSplitter

doc = Document(content="This is a simple test document")
splitter = HierarchicalDocumentSplitter(block_sizes={3, 2}, split_overlap=0, split_by="word")
splitter.run([doc])

>> {'documents': [Document(id=3f7..., content: 'This is a simple test document', meta: {'block_size': 0, 'parent_id': None, 'children_ids': ['5ff..', '8dc..'], 'level': 0}),
>> Document(id=5ff.., content: 'This is a ', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['f19..', '52c..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
>> Document(id=8dc.., content: 'simple test document', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['39d..', 'e23..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 10}),
>> Document(id=f19.., content: 'This is ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
>> Document(id=52c.., content: 'a ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 8}),
>> Document(id=39d.., content: 'simple test ', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
>> Document(id=e23.., content: 'document', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 12})]}
```

### In a pipeline

This Haystack pipeline processes `.md` files by converting them to documents, cleaning the text, splitting it into sentence-based chunks, and storing the results in an In-Memory Document Store.

```python
from pathlib import Path

from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import HierarchicalDocumentSplitter
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()

Pipeline = Pipeline()
Pipeline.add_component(instance=TextFileToDocument(), name="text_file_converter")
Pipeline.add_component(instance=DocumentCleaner(), name="cleaner")
Pipeline.add_component(instance=HierarchicalDocumentSplitter(
	block_sizes={10, 6, 3}, split_overlap=0, split_by="sentence", name="splitter"
)
Pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
Pipeline.connect("text_file_converter.documents", "cleaner.documents")
Pipeline.connect("cleaner.documents", "splitter.documents")
Pipeline.connect("splitter.documents", "writer.documents")

path = "path/to/your/files"
files = list(Path(path).glob("*.md"))
Pipeline.run({"text_file_converter": {"sources": files}})
```
